Merge dev into main for 5.0 release. by tfenne · Pull Request #1779 · samtools/htsjdk

tfenne · 2026-05-01T19:25:58Z

Description

Merge dev into main for 5.0 release. All code has previously been through PRs into dev.

#1758) * Fix SamLocusIterator so that read position is not incorrectly offset after transiting insertions that fail the base quality check.

… improvement in performance (#1764) * Optimize BAM nibble-to-ASCII base decoding with bulk lookup table. Replace the per-nibble method call chain in SAMUtils.compressedBasesToBytes with a 512-byte pre-computed lookup table (NIBBLE_PAIR_LOOKUP) that decodes two bases per iteration. Ported from htslib's code2base approach. The previous implementation made 4 method calls per base pair, each involving a try/catch and array lookup. The new version does a single table lookup per compressed byte with no method calls, halving loop iterations. Benchmarks show ~7% speedup on pure BAM reading. Remove dead methods compressedBaseToByte, compressedBaseToByteLow, and compressedBaseToByteHigh which are no longer called. Add tests for odd-length sequences, single base, empty sequence, offset, and round-trip encoding. * Eliminate redundant reference name/index resolution during BAM decoding. BAMRecord construction was performing ~10 dictionary lookups per record (index->name->index round-trips for both ref and mate, then again via a redundant setHeader() call in BAMRecordCodec.decode()). This reduces it to 2 lookups per record (one resolveNameFromIndex each for ref and mate) by: - Adding package-private setReferenceNameAndIndex/setMateReferenceNameAndIndex methods to SAMRecord that set both fields directly - Using these in the BAMRecord constructor instead of setReferenceIndex/ setMateReferenceIndex which round-trip through name resolution - Removing the redundant setHeader(header) call in BAMRecordCodec.decode() since the header is already set in the BAMRecord constructor Profiling showed reference resolution at ~8% of CPU samples during BAM reading. Benchmarking on a 12.3GB BAM (207M records) confirmed a ~7-8% wall-clock improvement (104.2s -> 96.2s). * Optimize BAM tag decoding: single-pass string reads and skip redundant validation. Two targeted optimizations to the BAM tag decoding path: 1. BinaryTagCodec.readNullTerminatedString: replace the double-pass approach (scan forward for null, reset, re-read) with a single-pass scan over the backing byte array, eliminating the intermediate byte[] allocation per string tag. 2. SAMBinaryTagAndValue/SAMBinaryTagAndUnsignedArrayValue: add package-private constructors that skip isAllowedAttributeValue() validation, used from BinaryTagCodec.readTags() where the value types are known to be valid from the BAM type codes. Profiling with eager decode showed tag decoding at 18.2% of CPU. Benchmarking on a 3.9GB BAM (52M records) showed a ~3% improvement in the eager-decode path (40.0s -> 38.9s) with no regression in the lazy path.

#1768) * Integrate jlibdeflate for faster DEFLATE compression and decompression Add optional jlibdeflate (libdeflate JNI bindings) support that is automatically used when the native library is available on the classpath, with transparent fallback to the JDK Inflater/Deflater. - DeflaterFactory/InflaterFactory auto-detect jlibdeflate availability at first use, caching the result for the lifetime of the JVM - LibdeflateDeflater/LibdeflateInflater extend JDK Deflater/Inflater and correctly handle both raw DEFLATE (nowrap=true) and zlib format (nowrap=false) - New Defaults.USE_LIBDEFLATE (samjdk.use_libdeflate) defaults to true; set to false to force JDK-only compression - jlibdeflate 0.1.0-SNAPSHOT added as a dependency from Sonatype Central snapshots

…build (#1769) * Update maven central publishing to the new central portal and document release publication steps in CONTRIBUTING.md * Upgrade Gradle to 9.4.1, and latest release version of all plugins; fix build deprecations

…1753) Add UnsatisfiedLinkError to the catch clause in tryToLoadSnappy() to handle cases where native library loading fails (e.g., musl/glibc detection issues in snappy-java 1.1.10.8+).

Initial implementation of CRAM 3.1 write support, contributed by Chris Norman. Adds a naive write profile that writes valid CRAM 3.1 files without yet using any of the format's new specialised codecs (FQZComp, Name Tokeniser, RangeCoder, etc.); all data series continue to use CRAM 3.0 codecs (rANS, GZIP). This squashes the four prior commits: - Enable a naive CRAM 3.1 write profile. - Changes based on instrumented trial runs. - CRAM 3.1 write tests and code cleanup. - Temp fix for preSorted=false default. This commit serves as the foundation that the following commit builds on to add the full CRAM 3.1 codec set, profile system, and performance optimisations. Co-authored-by: Tim Fennell <tim@fulcrumgenomics.com>

Builds out CRAM 3.1 write support on top of cnorman's initial naive profile, taking htsjdk to feature parity with samtools/htslib for writing CRAM 3.1 and matching samtools on cross-implementation fidelity across a broad range of real-world data. CRAM 3.1 codecs and supporting infrastructure --------------------------------------------- - Full CRAM 3.1 codec set: rANS Nx16 (Order 0/1, RLE, Stripe, Pack, Cat), FQZComp (quality scores), Range adaptive arithmetic coder, and the Name Tokeniser. - New compression profile system (FAST, NORMAL, SMALL, ARCHIVE), matching htslib's `--output-fmt-option fast/small/archive`. Each profile picks per-DataSeries compressors and (for SMALL/ARCHIVE) trial-compression candidate sets. - TrialCompressor: a wrapper that tries multiple compressors per block and keeps the smallest output. Replaces the ad-hoc tag triple- compression path with a uniform, profile-driven mechanism. - DataSeries content IDs aligned with htslib so CRAM dumps from the two implementations are directly comparable. Codec performance work ---------------------- - rANS codecs reworked to use a `byte[]` API and backwards-write, removing per-byte stream overhead and several layers of indirection. - GzipCodec uses Deflater/Inflater directly instead of going through GZIPInputStream/OutputStream, avoiding gzip framing overhead per block. - Name Tokeniser encoder: hand-written tokenisation replaces regex, per-type flags + STRIPE + duplicate-stream elimination + all-MATCH fast path significantly improve both speed and ratio. - FQZComp / Range coder / rANS encoder hot paths tightened. - Pooled RANSNx16Decode instance in the Name Tokeniser to avoid reallocating per call. - Replaced ByteArrayInputStream/OutputStream with unsynchronized CRAMByteReader/Writer to remove monitor overhead in tight loops. - Cached SAMTag key metadata to eliminate per-record String allocation during CRAM decode. - Fused read-base restoration, CIGAR building, and NM/MD computation into a single pass instead of three. - jlibdeflate compatibility fix on the ByteBuffer path. - Roughly 15% faster encoding overall vs the pre-optimisation state, with read decoding gains in line. Correctness fixes found during cross-implementation validation -------------------------------------------------------------- - TLEN is now computed using htslib's coordinate-extent rule (max(end1,end2) - min(start1,start2) + 1, signed by leftmost position) for CRAM-specific compute, rather than the SAM 5'-to-5' rule. Without this, every read with a supplementary alignment mismatched on TLEN through any CRAM round-trip. - CompressionHeader serialisation now uses a growable ByteArrayOutputStream rather than a fixed 100 KB ByteBuffer for the preservation map and tag encoding map. The TD tag dictionary alone can exceed 100 KB for rich tag sets (PacBio/Ultima flow-space, ONT modified-base tags) under the archive profile, where it would previously throw BufferOverflowException. - Slice flush now uses a dual record/bases threshold matching htslib's `seqs_per_slice` AND `bases_per_slice` (default = readsPerSlice * 500). Without this, archive-profile encoding of long-read data (PacBio HiFi 15 kb, ONT 30 kb+) would buffer ~1+ GB of quality scores per slice and OOM the FQZComp encoder. - Strip NM/MD on encode and regenerate them on decode (matching htslib's default `store_nm=0`/`store_md=0` behaviour). Implemented attached mate pairs to align the in-memory representation with the spec. - Restored CIGAR reconstruction when SEQ is `*` (CF_UNKNOWN_BASES). - Fixed crash when reading containers that contain no slices. - Removed unnecessary content digest tags from CRAM slice headers (htslib doesn't write them either; htsjdk's verification on read was overly strict). Tools ----- - `CramConverter`: a small command-line tool for converting between SAM/BAM/CRAM, primarily for benchmarking and exercising profiles. - `CramStats`: dump per-block compression statistics from a CRAM file for debugging and ratio analysis. Tests and CI ------------ - New hts-specs CRAM 3.0/3.1 compliance tests covering decode, index query, and round-trip behaviour. - FQZComp round-trip tests using the hts-specs quality data files. - CRAI index query correctness tests; codec roundtrip property tests. - Split CRAM31 fidelity tests into per-profile classes for parallel execution. - Reduced memory pressure in the unit tests to eliminate intermittent OOM failures on CI. - Sped up several long-running CRAM tests by caching test data, reducing slice-size matrices, and downsampling fixtures. - Misc: fixed a thread-safety bug in VariantContextTestProvider that was producing non-deterministic test counts. CHANGELOG / README updated for the 5.0.0 release. Co-authored-by: Chris Norman <cnorman@broadinstitute.org>

Snakemake pipeline that round-trips a curated set of public BAM/CRAM files through htsjdk and samtools and verifies output equivalence across all four CRAM 3.1 compression profiles (FAST, NORMAL, SMALL, ARCHIVE) and all four encode/decode combinations. The pipeline was used to validate the CRAM 3.1 write implementation in the previous commit against samtools 1.21 on real-world data spanning five sequencing platforms (Illumina, PacBio HiFi, ONT, Element AVITI, Ultima), seven assay types (WGS, WGBS, RNA-seq, scRNA-seq, exome, Hi-C, amplicon), and two organisms. Pipeline (validation/) ---------------------- - Snakefile: encode original -> CRAM with both htsjdk and samtools, decode each CRAM with both, then compare each decoded BAM against the original. 4 profiles * 4 encoder/decoder combinations = 16 comparisons per sample. - Two sample-sheet formats: samples.tsv - local file paths (3 columns) test_datasets.tsv - remote URLs auto-downloaded (9 columns) Format is auto-detected from header. Reference URLs may be a comma-separated list, with optional `#contig_name` suffix to rename a single-sequence FASTA's header on download (e.g. Lambda phage chrL spike-in for bisulfite samples). - Intermediate files (BAMs and CRAMs) are marked temp() so they are cleaned up as soon as their dependents finish; rule priorities push the DAG to drain depth-first, keeping peak disk bounded. - Per-rule log directives, retry/escalating-memory ladder for the Java jobs (4 GB -> 6 GB -> 8 GB -> 10 GB), and pixi.toml for a reproducible toolchain. CramComparison -------------- A general-purpose record-by-record SAM/BAM/CRAM comparison utility used by the pipeline as the comparator. Tolerates the universal CRAM roundtrip normalisations: - CIGAR =/X collapsed to M before comparison (CRAM stores match/ mismatch in read features, so the =/X distinction is lost on decode in both htsjdk and samtools). - mapQ ignored for unmapped reads (SAM spec leaves it undefined, and both implementations normalise it to 0 on decode). - Auto-generated MD/NM tags stripped when one side is missing them. - Unsigned B-array type differences tolerated (CRAM stores signed). Exit code 0 on both match and mismatch; exit 2 only on actual error, so Snakemake preserves the result file in either case. Co-authored-by: Claude <noreply@anthropic.com>

…ields (#1762) * Changed SAMRecord.toString() to emit the SAM format string with all fields. * Drop sync + trailing newline from SAMRecord.getSAMString. * Remove deprecated SAMRecord.format() and now-dead helpers. The text formatting path used by SAMRecord.toString() / getSAMString() has been cleaned up: - SAMTextWriter.getSAMString no longer uses a JVM-wide synchronized static cache of a shared StringWriter/SAMTextWriter. It now allocates a fresh StringWriter per call. Without this change, every toString() call would have taken a global monitor. - SAMTextWriter.writeAlignment is split into a private writeAlignmentNoNewline plus a thin wrapper that appends '\n'. getSAMString uses the no-newline variant, so the result no longer needs to be trimmed. BREAKING CHANGE: SAMRecord.getSAMString() no longer terminates the returned String with a newline character. This brings SAMRecord into line with the other getSAMString() implementations (SAMSequenceRecord, SAMReadGroupRecord, SAMProgramRecord, SAMFileHeader), which already return newline-free strings. Callers that relied on the trailing '\n' as a separator (e.g. concatenating two records' SAM strings) must now insert their own separator. Callers that stripped the newline with .trim() can drop the call. BREAKING CHANGE: SAMRecord.format() has been removed. Callers should use SAMRecord.getSAMString() instead.

* Make snapshot version names reflect the next planned release. Previously the build appended "-SNAPSHOT" to git-describe output, so all snapshots between releases were named after the *previous* release (e.g. "4.3.0-3-gabcdef0-SNAPSHOT"), the opposite of Maven convention where SNAPSHOTs are previews of the next release. The fix: declare a `nextVersionBump` ("x" / "x.x" / "x.x.x") at the top of build.gradle, and compute the next semver from the most recent release tag plus that bump. Snapshots are then named "<nextVersion>-<shortHash>-SNAPSHOT" (e.g. "5.0.0-23c681a8dc-SNAPSHOT"); including the hash makes each snapshot a distinct, pinnable artifact. Release builds (-Drelease=true) require HEAD to be on a semver tag and the working tree to be clean; the tag is the version (nextVersionBump is ignored on release — the tag is authoritative). nextVersionBump is set to "x" since the next release will be a major one. * Remove dead Ant build and Travis CI config.

This is a mechanical reformat of every .java file under src/ via the Palantir Java Format style. There are no behavior changes. The follow-up commit adds the Spotless Gradle plugin that produced this output, wires it into the build as an enforcement step, updates the contributor docs, deletes the obsolete java-style-eclipse.xml and java-style-intellij.xml style guides, and adds .git-blame-ignore-revs so this commit is skipped by `git blame`. The single non-mechanical change in this commit is the removal of one stray "// the imports for unit testing." comment from VariantContextUnitTest.java that was preventing the formatter from treating the surrounding imports as a contiguous block. The comment carried no information. Part of #1761.

…1761) All build, CI, and contributor-facing changes for the new code-style enforcement. The bulk reformat itself is in the previous commit; this commit only contains setup. build.gradle: - Add the com.diffplug.spotless plugin (8.4.0), configured with palantirJavaFormat() over src/**/*.java. There are no formatter knobs -- the formatter is the style guide. - Make compileJava depend on spotlessJavaApply so any unformatted source is auto-rewritten in place during the build. Contributors don't have to remember to run a format command -- just build, and the source ends up formatted. spotlessJavaApply is sub-second warm thanks to Gradle's up-to-date checking, so the cost on every build is negligible. .github/workflows/tests.yml: - Add a `formatCheck` CI job running `./gradlew spotlessCheck` (verify-only, no mutation) so unformatted code can't slip past CI. The local auto-format is a convenience; CI is the enforcement boundary. CONTRIBUTING.md: - New "Code Style" section: how the auto-format works, when/why you might still want to run `spotlessApply` directly, that CI verifies rather than rewrites, and how to opt in to .git-blame-ignore-revs locally. .git-blame-ignore-revs: - New file pointing at the bulk-format commit so `git blame` (and the GitHub web blame view, which honors the file automatically) credits the original author of each line rather than the reformat commit. Deletions: - java-style-eclipse.xml and java-style-intellij.xml -- stale, unreferenced by the build, and inconsistent with Palantir's output. Closes #1761.

The NCBI SRA reader, backed by the gov.nih.nlm.ncbi:ngs-java native library, has been a recurring source of build/runtime friction and is being removed for v5.0.0. Downstream consumers that need SRA must implement it themselves or use a different library. Removed: - gov.nih.nlm.ncbi:ngs-java api dependency from build.gradle. - The "sra" TestNG group from build.gradle's test and testExternalApis tasks (and from htsjdk.TestDataProviders.EXCLUDED_GROUPS). - The -Dsamjdk.sra_libraries_download=true JVM arg from testExternalApis. The corresponding Defaults.SRA_LIBRARIES_DOWNLOAD field is removed. - 10 main sources under htsjdk/samtools and htsjdk/samtools/sra/ (SRAFileReader, SRAIterator, SRAIndex, SRAAccession, SRALazyRecord, SRAAlignmentIterator, SRAUnalignmentIterator, SRAUtils, ReferenceCache, SRAIndexedSequenceFile). - 7 test classes under htsjdk/samtools/sra/ and the test_archive.sra resource. - SamReader.Type.SRA_TYPE. - The README's SRA license-attribution sentence. Modified (SRA hooks removed): - SamInputResource: drops InputResource.Type.SRA_ACCESSION, the abstract asSRAAccession() method and all its overrides, the SRAInputResource inner class, and the SamInputResource.of(SRAAccession) factory. - SamReaderFactory: drops the SRA dispatch in the resource type switch, the isSra() autodetection branch, the abstract applyTo(SRAFileReader, ...) method on Option and all five overrides of it, plus the instanceof SRAFileReader branch. - SamReaderFactoryTest, SamReaderTest, ReadsBundle (TODO comment): minor follow-on edits where these referenced the removed types. Verification: - ./gradlew compileJava compileTestJava: passes. - ./gradlew test: 21,877 / 21,877 pass (down from 21,936 on dev; the difference is the removed sra test classes plus DataProvider-discovered test paths that referenced the sra group). - ./gradlew spotbugsMain spotbugsTest: clean, no new findings. - ./gradlew spotlessCheck: clean (autoformat applied during compile). - Stragglers: grep'd src/, build.gradle, README, CONTRIBUTING, CHANGELOG, and .github/ for "SRA" tokens, "import ngs.", "sra" string literals, "ngs-java", and "sra_libraries" -- no remaining references. CHANGELOG entry deferred to the dedicated v5.0.0 CHANGELOG task.

) Slim htsjdk's published runtime classpath by making the JavaScript engine an opt-in dependency, and clean up around the change. Net effect for consumers of the v5.0.0 artifact: 6 fewer jars on the runtime classpath (nashorn-core + 5 ASM transitives, ~2.5 MB) and one less "misleading direct dependency" in the published POM. ** BREAKING CHANGE ** Consumers using htsjdk.samtools.filter.JavascriptSamRecordFilter or htsjdk.variant.variantcontext.filter.JavascriptVariantFilter must now add a JSR-223 "js" engine to their own runtime classpath. The recommended choice is OpenJDK Nashorn: Gradle: runtimeOnly 'org.openjdk.nashorn:nashorn-core:15.7' Maven: <dependency> <groupId>org.openjdk.nashorn</groupId> <artifactId>nashorn-core</artifactId> <version>15.7</version> <scope>runtime</scope> </dependency> If no engine is on the classpath at runtime, AbstractJavascriptFilter's constructor now throws a RuntimeScriptException whose message names the calling filter class and prints both Gradle and Maven coordinates, so the failure is fully self-describing. The breaking change will be called out in the dedicated v5.0.0 CHANGELOG / breaking-changes PR alongside the other v5.0.0 changes.

Expands the existing 5.0.0 stub (which previously covered only the CRAM 3.1 write work) into a complete entry covering everything since 4.3.0. Adds: - Lead headlines summarizing the major themes (CRAM 3.1 writing, slimmer runtime deps, faster BAM [de]compression, enforced formatting, fixed test reporting). - A prominent ⚠️ Breaking changes section calling out SRA removal, Nashorn now opt-in, the SAMRecord.toString() format change, the removed CRAM slice digest tags, and the new default CRAM version 3.1. - A new "CRAM correctness and cross-implementation fixes" section consolidating the read- and write-path fixes that improve interop with samtools/htslib (TLEN computation, CIGAR =/X comparison, CIGAR reconstruction for sequence '*', container-with-no-slices crash, archive header overflow, unmapped-read query, supplementary/secondary read-name limitation). - Performance entries beyond just the CRAM-internal optimizations: jlibdeflate integration, the BAM decoding path improvements, and a long-read-friendly bases-per-slice threshold. - A bug-fix section covering the LTF8 9-byte write fix, the SamLocusIterator offset bug, the SamPairUtil dovetail fix, and the snappy native-load UnsatisfiedLinkError catch. - A build, tooling, and dependency clean-up section: Palantir Java Format + Spotless enforcement, Maven Central portal migration, snapshot version naming, deprecation cleanup, the test-runner pass/fail-reporting fix, and the dependency clean-up (commons-logging constraint, Nashorn compileOnly, ngs-java removal). - A compatibility line noting JDK 17 / 21 / 24 spot-checks. - Expanded testing entries: the hts-specs CRAM 3.0/3.1 compliance tests, FQZComp round-trip tests, CRAI correctness tests, test-suite speedups, the CEUTrio test-data downsizing, and the JS filter test bulk-up. Source for the additions was the full git log since the 4.3.0 tag plus the unsquashed backup branch tf_cram_31_backup_20260425, which retains fine-grained commits that the merged CRAM 3.1 PR squashed away. The CRAM write-speed gains are intentionally not headlined yet -- prior htsjdk wrote CRAM 3.0 (lower compression, fewer codec passes), so "faster" without "and same/better compression" would be misleading. We'll revisit the perf bullet after benchmarking against samtools.

@return

Cleans up 100 javadoc warnings flagged by the javadoc target. Fixes fall into a few well-defined categories: unescaped angle brackets and ampersands in doc text, malformed @link references (missing braces, invalid this#, double-# chains), an unresolved package reference, an unclosed @code tag, @return tags on void methods and constructors, and a duplicate @param. Doc-only changes; no behavioral impact.

tfenne and others added 17 commits April 25, 2026 06:12

Fix SamLocusIterator so that read position is not incorrectly offset … (

bdd6791

#1758) * Fix SamLocusIterator so that read position is not incorrectly offset after transiting insertions that fail the base quality check.

Fix existing deprecations in code and build. (#1767)

2a5564a

fix: catch UnsatisfiedLinkError when loading snappy native library (#…

b06e6e4

…1753) Add UnsatisfiedLinkError to the catch clause in tryToLoadSnappy() to handle cases where native library loading fails (e.g., musl/glibc detection issues in snappy-java 1.1.10.8+).

tfenne merged commit fb57878 into master May 1, 2026
3 of 4 checks passed

tfenne deleted the dev branch May 1, 2026 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dev into main for 5.0 release.#1779

Merge dev into main for 5.0 release.#1779
tfenne merged 17 commits into
masterfrom
dev

tfenne commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tfenne commented May 1, 2026

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants